# @hidden_cell
# The project token is an authorization token that is used to access project resources like data sources, connections, and used by platform APIs.
from project_lib import Project
project = Project(project_id='...', project_access_token='...')
This notebook relates to the NOAA Weather Dataset - JFK Airport (New York). The dataset contains 114,546 hourly observations of 12 local climatological variables (such as temperature and wind speed) collected at JFK airport. This dataset can be obtained for free from the IBM Developer Data Asset Exchange.
In this notebook we explore approaches to predicting future temperatures by using the time-series dataset.
Before you run this notebook complete the following steps:
When you import this project from the Watson Studio Gallery, a token should be automatically generated and inserted at the top of this notebook as a code cell such as the one below:
# @hidden_cell
# The project token is an authorization token that is used to access project resources like data sources, connections, and used by platform APIs.
from project_lib import Project
project = Project(project_id='YOUR_PROJECT_ID', project_access_token='YOUR_PROJECT_TOKEN')
pc = project.project_context
If you do not see the cell above, follow these steps to enable the notebook to access the dataset from the project's resources:
More -> Insert project token
in the top-right menu sectionThis should insert a cell at the top of this notebook similar to the example given above.
If an error is displayed indicating that no project token is defined, follow these instructions.
Run the newly inserted cell before proceeding with the notebook execution below
Import and configure the required modules.
import numpy as np
import pandas as pd
from sklearn.metrics import mean_squared_error
from statsmodels.tsa.statespace.sarimax import SARIMAX
from matplotlib import pyplot as plt
We start by reading the cleaned dataset that was created in the project notebook Part 1 - Data Cleaning
.
Note: if you haven't yet run this notebook, run it first; otherwise the cells below will not work.
def get_file_handle(fname):
# Project data path for the raw data file
data_path = project.get_file(fname)
data_path.seek(0)
return data_path
# Using pandas to read the data
# Since the `DATE` column consists date-time information, we use Pandas parse_dates keyword for easier data processing
data_path = get_file_handle('jfk_weather_cleaned.csv')
data = pd.read_csv(data_path, parse_dates=['DATE'])
# Set date index
data = data.set_index(pd.DatetimeIndex(data['DATE']))
data.drop(['DATE'], axis=1, inplace=True)
data.head()
For purposes of time-series modeling, we will restrict our analysis to a 2-year sample of the dataset to avoid overly long model-training times.
sample = data['2016-01-01':'2018-01-01']
sample.info()
Before we attempt any time-series analysis and prediction, we should split the dataset into training, validation and test sets. We use a portion of the data for training, and a portion of future data for our validation and test sets.
If we instead trained a model on the full dataset, the model would learn to be very good at making predictions on that particular dataset, essentially just copying the answers it knows. However, when presented with data the model has not seen , it would perform poorly since it has not learned how to generalize its answers.
By training on a portion of the dataset and testing the model's performance on another portion of the dataset (which data the model has not seen in training), we try to avoid our models "over-fitting" the dataset and make them better at predicting temperatures given unseen, future data. This process of splitting the dataset and evaluating a model's performance on the validation and test sets is commonly known as cross-validation).
By default here we use 80% of the data for the training set and 10% each for validation and test sets.
def split_data(data, val_size=0.1, test_size=0.1):
"""
Splits data to training, validation and testing parts
"""
ntest = int(round(len(data) * (1 - test_size)))
nval = int(round(len(data) * (1 - test_size - val_size)))
df_train, df_val, df_test = data.iloc[:nval], data.iloc[nval:ntest], data.iloc[ntest:]
return df_train, df_val, df_test
# Create data split
df_train, df_val, df_test = split_data(sample)
print('Total data size: {} rows'.format(len(sample)))
print('Training set size: {} rows'.format(len(df_train)))
print('Validation set size: {} rows'.format(len(df_val)))
print('Test set size: {} rows'.format(len(df_test)))
In this section, we'll create a few simple predictive models of temperature, using shifting and rolling averages. These will serve as a baseline against which we can compare more sophisticated models.
Using values at recent timesteps (such as the most recent timestep t-1
and second-most recent timestep t-2
) to predict the current value at time t
is what's known as persistence modeling, or using the last observed value to predict the next following value. These preceding timesteps are often referred to in time-series analysis as lags
. So, the value at time t-1
is known as the 1st lag
and the value at time t-2
is the 2nd lag
.
We can also create baselines based on rolling (or moving) averages. This is a time-series constructed by averaging each lagged value up to the selected lag. For example, a 6-period (or 6-lag) rolling avearge is the average of the previous 6 hourly lags t-6
to t-1
.
Our baseline models will be:
1st lag
- i.e. values at t-1
2nd lag
- i.e. values at t-2
6-lag
rolling average12-lag
rolling average# define the column containing the data we wish to model - in this case Dry Bulb Temperature (F)
Y_COL = 'dry_bulb_temp_f'
# Use shifting and rolling averages to predict Y_COL (t)
n_in = 2
n_out = 1
features = [Y_COL]
n_features = len(features)
# create the baseline on the entire sample dataset.
# we will evaluate the prediction error on the validation set
baseline = sample[[Y_COL]].loc[:]
baseline['{} (t-1)'.format(Y_COL)] = baseline[Y_COL].shift(1)
baseline['{} (t-2)'.format(Y_COL)] = baseline[Y_COL].shift(2)
baseline['{} (6hr rollavg)'.format(Y_COL)] = baseline[Y_COL].rolling('6H').mean()
baseline['{} (12hr rollavg)'.format(Y_COL)] = baseline[Y_COL].rolling('12H').mean()
baseline.dropna(inplace=True)
baseline.head(10)
Next, we will plot data from our validation dataset to get a sense for how well these baseline models predict the next hourly temperature. Note thatd we only use a few days of data in order to make the plot easier to view.
# plot first 7 days of the validation set, 168 hours
start = df_val.index[0]
end = df_val.index[167]
sliced = baseline[start:end]
# Plot baseline predictions sample
cols = ['dry_bulb_temp_f', 'dry_bulb_temp_f (t-1)', 'dry_bulb_temp_f (t-2)', 'dry_bulb_temp_f (6hr rollavg)', 'dry_bulb_temp_f (12hr rollavg)']
sliced[cols].plot()
plt.legend(['t', 't-1', 't-2', '6hr', '12hr'], loc=2, ncol=3)
plt.title('Baselines for First 7 Days of Validation Set')
plt.ylabel('Temperature (F)')
plt.tight_layout()
plt.rcParams['figure.dpi'] = 100
plt.show()
As you can perhaps see from the graph above, the lagged baselines appear to do a better job of forecasting temperatures than the rolling average baselines.
In order to evaluate our baseline models more precisely, we need to answer the question "how well do our models predict future temperature?". In regression problems involving prediction of a numerical value, we often use a measure of the difference between our predicted value and the actual value. This is referred to as an error measure or error metric. A common measure is the Mean Squared Error (MSE):
\begin{equation} MSE = \frac{1}{n} \sum_{i=1}^{n}{(y_i - \hat y_i)^{2}} \end{equation}This is the average of the squared differences between predicted values $ \hat y $ and actual values $ y $.
Because the MSE is in "units squared" it can be difficult to interpet, hence the Root Mean Squared Error (RMSE) is often used:
\begin{equation} RMSE = \sqrt {MSE} \end{equation}This is the square root of the MSE, and is in the same units as the values $ y $. We can compare the RMSE (and MSE) values for different models and say that the model that has the lower MSE is better at predicting temperatures, all things equal. Note that MSE and RMSE will grow large quickly if the differences between predicted and actual values are large. This may or may not be a desired quality of your error measure. In this case, it is probably a good thing, since a model that makes large mistakes in temperature prediction will be much less useful than one which makes small mistakes.
Next, we calculate the RMSE measure for each of our baseline models, on the full validation set.
# Calculating baseline RMSE
start_val = df_val.index[0]
end_val = df_val.index[-1]
baseline_val = baseline[start_val:end_val]
baseline_y = baseline_val[Y_COL]
baseline_t1 = baseline_val['dry_bulb_temp_f (t-1)']
baseline_t2 = baseline_val['dry_bulb_temp_f (t-2)']
baseline_avg6 = baseline_val['dry_bulb_temp_f (6hr rollavg)']
baseline_avg12 = baseline_val['dry_bulb_temp_f (12hr rollavg)']
rmse_t1 = round(np.sqrt(mean_squared_error(baseline_y, baseline_t1)), 2)
rmse_t2 = round(np.sqrt(mean_squared_error(baseline_y, baseline_t2)), 2)
rmse_avg6 = round(np.sqrt(mean_squared_error(baseline_y, baseline_avg6)), 2)
rmse_avg12 = round(np.sqrt(mean_squared_error(baseline_y, baseline_avg12)), 2)
print('Baseline t-1 RMSE: {0:.3f}'.format(rmse_t1))
print('Baseline t-2 RMSE: {0:.3f}'.format(rmse_t2))
print('Baseline 6hr rollavg RMSE: {0:.3f}'.format(rmse_avg6))
print('Baseline 12hr rollavg RMSE: {0:.3f}'.format(rmse_avg12))
The RMSE results confirm what we saw in the graph above. It is clear that the rolling average baselines perform poorly. In fact, the t-2
lagged baseline is also not very good. It appears that the best baseline model is to simply use the current hour's temperature to predict the next hour's temperature!
Can we do better than this simple baseline using more sophisticated models?
In the previous section, we saw that a simple lag-1
baseline model performed reasonably well at forecasting temperature for the next hourly time step. This is perhaps not too surprising, given what we know about hourly temperatures. Generally, the temperature in a given hour will be quite closely related to the temperature in the previous hour. This phenomenon is very common in time-series analysis and is known as autocorrelation - that is, the time series is correlated
with previous values of itself. More precisely, the values at time t
are correlated with lagged values (which could be t-1
, t-2
and so on).
Another thing we saw previously is the concept of moving averages. In this case the moving-average baseline was not that good at prediction. However it is common in many time-series for a moving average to capture some of the underlying structure and be useful for prediction.
In order to make our model better at predicting temperature, ideally we would want to take these aspects into account. Fortunately, the statistical community has a long history of analyzing time series and has created many different forecasting models.
Here, we will explore one called SARIMAX - the Seasonal AutoRegressive Integrated Moving Average with eXogenous regressors model.
This sounds like a very complex name, but if we look at the components of the name, we see that it includes autocorrelation
(this is what auto regressive means) and moving averages
, which are the components mentioned above.
The SARIMAX model also allows including a seasonal model component as well as handling exogenous variables, which are external to the time-series value itself. For example, for temperature prediction we may wish to take into account not just previous temperature values, but perhaps other weather features which may have an effect on temperature (such as humidity, rainfall, wind, and so on).
For the purposes of this notebook, we will not explore modeling of seasonal components or exogenous variables.
If we drop the "S" and "X" from the model, we are left with an ARIMA model (Auto-regressive Integrated Moving Average). This is a very commonly used model for time-series analysis and we will use it in this notebook by only specifying the relevant model components of the full SARIMAX model.
As a starting point, we will see how we can use SARIMAX to create a simple model that in fact replicates one of the baselines we created previously. Auto-regression, as we have seen, means using values from preceding time periods to predict the current value. Recall that one of our baseline models was the 1st lag
or t-1
model. In time-series analysis this is referred to as an AR(1) model, meaning an Auto-Regressive model for lag 1
.
Technically, the AR(1) model is not exactly the same as our baseline model. A statistical time series model like SARIMAX learns a set of weights
to apply to each component of the model. These weights are set so as to best fit the dataset. We can think of our baseline as setting the weight
for the t-1
lag to be exactly 1
. In practice, our time-series model will not have a weight of exactly 1
(though it will likely be very close to that), hence the predictions will be slightly different.
Now, lets fit our model to the dataset. First, we will set up the model inputs by taking the temperature column of our dataframe. We do this for training and validation sets.
X_train = df_train[Y_COL]
X_val = df_val[Y_COL]
X_both = np.hstack((X_train, X_val))
Here we created a variable called X_both
to cover both the training and validation data. This is required later when we forecast values for our SARIMAX model, in order to give the model access to all the datapoints for which it must create forecasts. Note that the forecasts themselves will only be based on the model weights learned from the training data (this is important for over-fitting as we have seen above)!
The SARIMAX model takes an argument called order
: this specifies the components of the model and itself has 3 parts: (p, d, q)
. p
denotes the lags for the AR model and q
denotes the lags for the MA model. We will not cover the d
parameter here. Taken together this specifies the parameters of the ARIMA model portion of SARIMAX.
To create an AR(1) model, we set the order
to be (1, 0, 0)
. This sets up the AR model to be a lag 1
model. Then, we fit our model on the training data and inspect a summary of the trained model.
order = (1, 0, 0)
model_ar1 = SARIMAX(X_train, order=order)
results_ar1 = model_ar1.fit()
results_ar1.summary()
There's quite a lot of information printed out in the model summary above. Much of it is related to the statistical properties of our model.
The most important thing for now is to look at the second table, where we can see a coef
value of 0.9996
for the weight ar.L1
. This tells us the model has set a weight for the 1st lag
component of the AR model to be 0.9996
. This is almost 1
and hence we should expect the prediction results to indeed be close to our t-1
baseline.
Let's create our model forecast on the validation dataset. We will then plot a few data points like we did with our baseline models (using 7 days of validation data) and compute the RMSE value based on the full validation set.
full_data_ar1 = SARIMAX(X_both, order=order)
model_forecast_ar1 = full_data_ar1.filter(results_ar1.params)
start = len(X_train)
end = len(X_both)
forecast_ar1 = model_forecast_ar1.predict(start=start, end=end - 1, dynamic=False)
# plot actual vs predicted values for the same 7-day window for easier viewing
plt.plot(sliced[Y_COL].values)
plt.plot(forecast_ar1[:168], color='r', linestyle='--')
plt.legend(['t', 'AR(1)'], loc=2)
plt.title('AR(1) Model Predictions for First 7 Days of Validation Set')
plt.ylabel('Temperature (F)')
plt.tight_layout()
plt.show()
We can see that the plot looks almost identical to the plot above, for the t
and t-1 baseline
values.
Next, we compute the RMSE values.
# compute print RMSE values
rmse_ar1 = np.sqrt(mean_squared_error(baseline_val[Y_COL], forecast_ar1))
print('AR(1) RMSE: {0:.3f}'.format(rmse_ar1))
print('Baseline t-1 RMSE: {0:.3f}'.format(rmse_t1))
We can see that the RMSE values for the validation set also almost identical.
One of our baseline models was a lag 2
model, i.e. t-2
. We saw that it performed a lot worse than the t-1
baseline. Intuitively, this makes sense, since we are throwing away a lot of information about the most recent lag t-1
. However, the t-2
lag still provides some useful information. In fact, for temperature prediction it's likely that the last few hours can provide some value.
Fortunately, our ARIMA model framework provides an easy way to incorporate further lag information. We can construct a model that includes both the t-1
and t-2
lags. This is an AR(2) model (meaning an auto-regressive model up to lag 2
). We can specify this with the model order parameter p=2
.
order = (2, 0, 0)
model_ar2 = SARIMAX(X_train, order=order)
results_ar2 = model_ar2.fit()
results_ar2.summary()
This time, the results table indicates a weight for variable ar.L1
and ar.L2
. Note the values are now quite different from 1
(or 0.5
say, for a simple equally-weighted model). Next, we compute the RMSE on the validation set.
full_data_ar2 = SARIMAX(X_both, order=order)
model_forecast_ar2 = full_data_ar2.filter(results_ar2.params)
start = len(X_train)
end = len(X_both)
forecast_ar2 = model_forecast_ar2.predict(start=start, end=end - 1, dynamic=False)
# compute print RMSE values
rmse_ar2 = np.sqrt(mean_squared_error(baseline_val[Y_COL], forecast_ar2))
print('AR(2) RMSE: {0:.3f}'.format(rmse_ar2))
print('AR(1) RMSE: {0:.3f}'.format(rmse_ar1))
print('Baseline t-1 RMSE: {0:.3f}'.format(rmse_t1))
We've improved the RMSE value by including information from the first two lags.
In fact, you will see that if you continue to increase the p
parameter value, the RMSE will continue to decrease, indicating that a few recent lags provide useful information to our model.
Finally, what if we also include moving average information in our model? The ARIMA framework makes this easy to do, by setting the order parameter q
. A value of q=1
specifies a MA(1) model (including the first lag t-1
), while q=6
would include all the lags from t-1
to t-6
.
Note that the moving average model component is a little different from the simple moving or rolling averages computed in the baseline models. The definition of the MA model is rather technical, but conceptually you can think of it as using a form of weighted moving average (compared to our baseline which would be a simple, unweighted average).
Let's add an MA(1) component to our AR(2) model.
order = (2, 0, 1)
model_ar2ma1 = SARIMAX(X_train, order=order)
results_ar2ma1 = model_ar2ma1.fit()
results_ar2ma1.summary()
We see the results table shows an additional weight value for ma.L1
, our MA(1) component. Next, we compare the RMSE to the other models and finally plot all the model forecasts together - note we use a much smaller 48-hour window to make the plot readable for illustrative purposes.
full_data_ar2ma1 = SARIMAX(X_both, order=order)
model_forecast_ar2ma1 = full_data_ar2ma1.filter(results_ar2ma1.params)
start = len(X_train)
end = len(X_both)
forecast_ar2ma1 = model_forecast_ar2ma1.predict(start=start, end=end - 1, dynamic=False)
# compute print RMSE values
rmse_ar2ma1 = np.sqrt(mean_squared_error(baseline_val[Y_COL], forecast_ar2ma1))
print('AR(2) MA(1) RMSE: {0:.3f}'.format(rmse_ar2ma1))
print('AR(2) RMSE: {0:.3f}'.format(rmse_ar2))
print('AR(1) RMSE: {0:.3f}'.format(rmse_ar1))
print('Baseline t-1 RMSE: {0:.3f}'.format(rmse_t1))
# plot actual vs predicted values for a smaller 2-day window for easier viewing
hrs = 48
plt.plot(sliced[Y_COL][:hrs].values)
plt.plot(forecast_ar1[:hrs], color='r', linestyle='--')
plt.plot(forecast_ar2[:hrs], color='g', linestyle='--')
plt.plot(forecast_ar2ma1[:hrs], color='c', linestyle='--')
plt.legend(['t', 'AR(1)', 'AR(2)', 'AR(2) MA(1)'], loc=2, ncol=1)
plt.title('ARIMA Model Predictions for First 48 hours of Validation Set')
plt.ylabel('Temperature (F)')
plt.tight_layout()
plt.show()
We've again managed to reduce the RMSE value for our model, indicating that adding the MA(1) component has improved our forecast!
Congratulations! You've applied the basics of time-series analysis for forecasting hourly temperatures. See if you can further improve the RMSE values by exploring the different values for the model parameters p
, q
and even d
!
This notebook was created by the Center for Open-Source Data & AI Technologies.
Copyright © 2019 IBM. This notebook and its source code are released under the terms of the MIT License.